AITopics | data-centric framework

Collaborating Authors

data-centric framework

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Synthetic Clinical Notes for Rare ICD Codes: A Data-Centric Framework for Long-Tail Medical Coding

Vo, Truong, Wu, Weiyi, Ding, Kaize

arXiv.org Artificial IntelligenceNov-19-2025

Automatic ICD coding from clinical text is a critical task in medical NLP but remains hindered by the extreme long-tail distribution of diagnostic codes. Thousands of rare and zero-shot ICD codes are severely underrepresented in datasets like MIMIC-III, leading to low macro-F1 scores. In this work, we propose a data-centric framework that generates high-quality synthetic discharge summaries to mitigate this imbalance. Our method constructs realistic multi-label code sets anchored on rare codes by leveraging real-world co-occurrence patterns, ICD descriptions, synonyms, taxonomy, and similar clinical notes. Using these structured prompts, we generate 90,000 synthetic notes covering 7,902 ICD codes, significantly expanding the training distribution. We fine-tune two state-of-the-art transformer-based models, PLM-ICD and GKI-ICD, on both the original and extended datasets. Experiments show that our approach modestly improves macro-F1 while maintaining strong micro-F1, outperforming prior SOT A. While the gain may seem marginal relative to the computational cost, our results demonstrate that carefully crafted synthetic data can enhance equity in long-tail ICD code prediction.

discharge summary, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2511.14112

Genre: Research Report > New Finding (0.54)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Health & Medicine > Health Care Providers & Services (1.00)
Health & Medicine > Health Care Technology > Medical Record (0.62)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)

Add feedback

Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments

Su, Hongjin, Sun, Ruoxi, Yoon, Jinsung, Yin, Pengcheng, Yu, Tao, Arık, Sercan Ö.

arXiv.org Artificial IntelligenceJan-18-2025

Autonomous agents powered by large language models (LLMs) have the potential to enhance human capabilities, assisting with digital tasks from sending emails to performing data analysis. The abilities of existing LLMs at such tasks are often hindered by the lack of high-quality agent data from the corresponding environments they interact with. We propose Learn-by-interact, a data-centric framework to adapt LLM agents to any given environments without human annotations. Learn-by-interact synthesizes trajectories of agent-environment interactions based on documentations, and constructs instructions by summarizing or abstracting the interaction histories, a process called backward construction. We assess the quality of our synthetic data by using them in both training-based scenarios and training-free in-context learning (ICL), where we craft innovative retrieval approaches optimized for agents. Extensive experiments on SWE-bench, WebArena, OSWorld and Spider2-V spanning across realistic coding, web, and desktop environments show the effectiveness of Learn-by-interact in various downstream agentic tasks -- baseline results are improved by up to 12.2\% for ICL with Claude-3.5 and 19.5\% for training with Codestral-22B. We further demonstrate the critical role of backward construction, which provides up to 14.0\% improvement for training. Our ablation studies demonstrate the efficiency provided by our synthesized data in ICL and the superiority of our retrieval pipeline over alternative approaches like conventional retrieval-augmented generation (RAG). We expect that Learn-by-interact will serve as a foundation for agent data synthesis as LLMs are increasingly deployed at real-world environments.

large language model, learn-by-interact, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2501.10893

Country: Asia > China > Hong Kong (0.04)

Genre:

Workflow (0.93)
Research Report (0.82)

Industry: Information Technology > Services (0.94)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

A Data-centric Framework for Improving Domain-specific Machine Reading Comprehension Datasets

Bojic, Iva, Halim, Josef, Suharman, Verena, Tar, Sreeja, Ong, Qi Chwen, Phung, Duy, Ravaut, Mathieu, Joty, Shafiq, Car, Josip

arXiv.org Artificial IntelligenceMay-26-2023

Low-quality data can cause downstream problems in high-stakes applications. Data-centric approach emphasizes on improving dataset quality to enhance model performance. High-quality datasets are needed for general-purpose Large Language Models (LLMs) training, as well as for domain-specific models, which are usually small in size as it is costly to engage a large number of domain experts for their creation. Thus, it is vital to ensure high-quality domain-specific training data. In this paper, we propose a framework for enhancing the data quality of original datasets. We applied the proposed framework to four biomedical datasets and showed relative improvement of up to 33%/40% for fine-tuning of retrieval/reader models on the BioASQ dataset when using back translation to enhance the original dataset quality.

comprehension dataset, data-centric framework, domain-specific machine

arXiv.org Artificial Intelligence

doi: 10.18653/v1/2023.insights-1.3

2304.00483

Genre: Research Report (0.40)

Industry: Education > Assessment & Standards > Student Performance (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.53)

Add feedback

A Data-Centric Framework for Composable NLP Workflows

Liu, Zhengzhong, Ding, Guanxiong, Bukkittu, Avinash, Gupta, Mansi, Gao, Pengzhi, Ahmed, Atif, Zhang, Shikun, Gao, Xin, Singhavi, Swapnil, Li, Linwei, Wei, Wei, Hu, Zecong, Shi, Haoran, Liang, Xiaodan, Mitamura, Teruko, Xing, Eric P., Hu, Zhiting

arXiv.org Artificial IntelligenceMar-2-2021

Empirical natural language processing (NLP) systems in application domains (e.g., healthcare, finance, education) involve interoperation among multiple components, ranging from data ingestion, human annotation, to text retrieval, analysis, generation, and visualization. We establish a unified open-source framework to support fast development of such sophisticated NLP workflows in a composable manner. The framework introduces a uniform data representation to encode heterogeneous results by a wide range of NLP tasks. It offers a large repository of processors for NLP tasks, visualization, and annotation, which can be easily assembled with full interoperability under the unified representation. The highly extensible framework allows plugging in custom processors from external off-the-shelf NLP and deep learning libraries. The whole framework is delivered through two modularized yet integratable open-source projects, namely Forte1 (for workflow infrastructure and NLP function processors) and Stave2 (for user interaction, visualization, and annotation).

processor, representation, workflow, (17 more...)

arXiv.org Artificial Intelligence

2103.01834

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Illinois (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
Europe > Iceland > Capital Region > Reykjavik (0.04)

Genre: Research Report (0.50)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Add feedback